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Abstract 

A  new  algorithm  for  the  solution  of  linear  recurrence  systems 
on  parallel  or  pipelined  computers  is  described.   Time  bounds,  speedup 
and  efficiency  for  SIMD  and  MIMD  computers  with  fixed  number  of  proces- 
sing elements  as  well  as  for  pipelined  computers  with  fixed  number  of 
stages  per  operation  are  obtained.   The  model  of  each  computer  is  dis- 
cussed in  detail  to  explain  better  performance  of  the  pipelined  model. 
A  simple  modification  in  the  design  of  processing  elements  for  parallel 
computers  makes  parallel  model  superior. 


Index  Terms: 
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1.        Introduction 

In  this  paper  a  new  algorithm  for  the  fast  solution  of  linear 
recurrence  systems  is  presented.   The  same  algorithm  in  two  different 
implementations  can  be  used  on  either  parallel  or  pipelined  computers. 

Parallel  linear  recurrence  solvers  have  been  discussed  previously 
[1-7].   The  algorithm  presented  in  this  paper  obtains  speed  improvements 
over  previous  algorithms  in  case  of  parallel  computers  with  fixed  number 
of  processors.   Solving  linear  recurrence  systems  on  pipelined  computers 
has  not  been  discussed  previously. 

In  Section  2,  the  definitions  and  notations  used  throughout  this 
paper  are  presented.   Then  the  Main  Theorem,  which  forms  the  basis  of  the 
algorithm  is  proved. 

In  Section  3,  the  algorithm  for  a  SIMD  and  MIMD  parallel 

computer  using  p  AEs  is  developed.   The  speedup  of  the  parallel  algorithm 

is  denoted  by  S  =  T, /T  where  Tn  and  T  are  computational  times  for 
Pip        1      P 

linear  recurrence  systems  using  1  and  p  processors,  respectively. 
Similarly,  efficiency  E  is  denoted  by  S  /p  . 

A  model  of  a  pipelined  computer  is  described  in  Section  4.   Each 
functional  unit  (multiplier,  adder,  ...)  in  the  model  has  s  stages.   It  is 
assumed  that  functional  units  can  be  chained  together.   It  is  shown  that 
a  variation  of  the  algorithm  for  parallel  machines  can  be  used  for 
pipelined  computers  too.   Computational  time  T   ,  speedup  S  and  efficiency 

E  are  defined  in  analogy  with  parallel  computers.   It  is  further  assumed 

that  each  pipelined  functional  unit  requires  one  time  step  for  a  single 


operation,  although  it  can  deliver  s  results  in  one  time  step  in  the 
vector  mode. 


2.        Main  Theorem 

Definition  1       An  m-th  order  linear  recurrence  system  R<n,m>  is 


the  set  of  n  equations 

^     T 


1  <  i  <  n  (1) 


where  *  denotes  matrix  multiplication  and  where  a.  =  (a.,,  a.„,  ....  a.  ,  c.) 

—i     il'   i2'       im'   3/ 

is  the  vector1  of  coefficients,  x.  ,  =  (x.  ,  ,  x.  „,  .  ..,  x.   ,  1)  is  the 

—l-l     l-l   i-2        l-m 

vector  of  variables  with  initial  vector  xn  =  (x„,  x  15  ....  x   ,-,,1) 

— u     u   -l        -m+l 

speficied  in  advance. 


Furthermore,  we  will  need  a  set  of  m+l  unit  vectors  e^  =  (e..  ,  e„ 


(2) 


If  a  row  (column)  of  any  matrix  is  equivalent  to  a  unit  vector,  it  is 
called  a  trivial  row  (column). 

Theorem  1 

Any  R<n,m>  system  can  be  written  in  the  form 

x.  =  b.  *  xj  (3) 

where  for  all  i,  1  <  i<n  ,  b.  =  (b.n,  b.„,  ...,  b.  ,  d.)   is  defined 
—   —     —i     il   i2        im   i 

recursively  as  follows: 

b.  -  a.  *  B.  1      1  <  i  <  n  (4) 

—l   —i   —l-l       —   — 

T 
with  matrix  B.^  =  (b._r  b._2,  ....  b.^,  e^)   ,  and  b^,  b_r  ..., 

b_m+1  equal  to  er  £2 e^. 


1  T 

Throughout  this  paper  any  vector  x  is  a  row  vector  and  x  is  corresponding 

column  vector. 


Proof  (by  mathematical  induction  on  i) 
(Basis) 

x  =  a  *  xT  ^   Def'  1 

xl   -1   *0 


-  (£X  *  (ir  *2'  ••"  V  W  }  *- 


-X) 


by  def.  of  identity  matrix 


■  («!  *  <V  »-i.  ••••  -"-*,•  s_n>  >  *  xo  by  de£-  of  *i 

*T  by  def.  of  b. 

—i   —o 


(Induction  step) 

:i+l  =  -i+1 

l-±+l 


£  by  Def.  1 

1+1  *  ((br  b._r  —  Wr  em+i)T  *  *o>    by  h^0thesis 


,    *  XT  by  def.  of  b 

-i+1   ^  X 


Theorem  1  shows  that  any  set  of  consecutive  variables  {x.|  j  <  i  <  k} 
can  be  computed  in  parallel,  that  is,  independently  of  each  other,  if 
vectors  of  parallel  coefficients  b_k  ,  j  <  i  1  k  ,  are  generated  and 
substituted  for  a.,  j  <  i  1  k  .   The  computation  of  all  b.  is  the  overhead 
that  must  be  paid  for  parallelism.   The  following  example  should  clarify 

this  idea. 

Let  us  consider  linear  recurrence  system  R<9,2>  .   Using 
standard  notation  of  linear  algebra  R<9,2>  can  be  written  as 
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This  system  can  be  transformed  using  (4)  into 
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where 


[b51,  b52,  d5]  =  [a51,  a52,  c5]  * 


10  0 
0  10 
0  0  1. 


[a51,  a52,  c5] 


[b61'  b62'  d6]  =  [a61'  a62'  C6]  * 


b51   b52   d5 


=  [a61  a51  +  a62'  a61  a52'  a61  C5  +  C6] 


[b71,  b?2,  d?]  -  [a71,  a?2,  c?]  * 


b61   b62   d6 
b51   b52   d5 


=  [a?1(a61  a51  +  a^)  4-  a?2  a51,  a?1  afil  a52  +  a?2  a^ 


l71(a61  C5  +  C6)  +  a72  C5  +  C7] 


[b81'  b82'  d8]  =  [a81'  a82'  C8]  * 


b71   b72   d7 
b61   b62   d6 


I    0 


ij 


=  [a81(a71(a61  a51  +  a62}  +  a72  a51}  +  a82(a61  a51  +  a62) 


a8l(a71  a61  a52  +  a72  a52)  +  a82  a61  a52 


a81(a71(a61  C5  +  C6)   +  a72  C5}  +  a82(a61  °5  +  C6}  +  C8] 


The  above  coefficients  can  be  obtained  by  direct  substitution 
of  xc  into  x, ,  xc  and  x,    into  x,  and  finally  xc ,  x, ,  and  x.,  into  xQ.   The 

J  0     J         O  /  JO/O 


new  transformed  matrix  allows  parallel  computation  of  xc,  x, ,  x^,  and  xc 

JO/  t 


since  they  do  not  depend  on  each  other  anymore.   It  seems  that  nothing  has 
been  accomplished  since  computation  of  b. ,  5  £  i  £  8,  must  be  executed 
serially,  and  it  requires  more  time  than  the  original  one.   However,  the 
generation  of  b. ,  5  <_   i  <_   8,  is  independent  of  any  other  subset  of  variables 
whose  intersection  with   {x.|5  _<  i  _<  8}   is  empty,  and  therefore,  the 
generation  of  b_.'s  for  any  two  disjoint  subsets  of  variables  can  be  per- 
formed concurrently. 

Corollary  1 

In  any  R<n,m>  the  computation  of 

b.  =  a.  *  B.  ,  ,       1  <  i  <  n  , 
—l   —l   —  i-1  —   — 

2 
requires  at  most  m(m+l)  multiplications  and  (m-1) (m+1)  +  1  =  m  additions. 

Furthermore,  the  computation  of 

T 
x.  =  b.  *  x^  ,         l<i<n 
l   —  i   — 0  —   — 

requires  at  most  m  multiplications  and  m  additions. 

■ 

Corollary  2 

In  any  R<n,m>  the  computation  of  all 

b  =  a  *  B  l<i<n     requires  no  more  than 

— i   -i   -i-1         -   — 

(n  -  -km+1))  m(m+l)      if  n  >  m  +  1 


m 


1  n(n-l)(m+l)  if  n  <  m  +  1 


multiplications,  and 


Ka(n) 


additions. 


(n  -  |(m+l))  m2         if  n  >  m  +  1 


■|  n(n-l)  m  if  n  f  m  +  1 


(6) 


Furthermore,  the  total  number  of  operations  required  is  equal  to 

(n  -  ^)  (2m2  +  m)      if  n  >  m  +  1 

K(n)  =  K  (n)  +  K  (n)  =  ( 

1 

-j  n(n-l)(2m  +  1)       if  n  <  m  +  1 


Proof 

For  all  i,  1  <_  i  <_  m  +  1,  _B._,  contains  only  i-1  nontrivial 

rows,  so  that  b_.  =  a.    •  B._.  requires  only  (i-1)  (m+1)  multiplications 

Therefore,  for  all  n  <  m  +  1 


K  (n)  =  Z    (i-1)  (m+1) 
i-1 

n 
=  (m+1)   I  (i-1) 
i=l 

=  (m+1)  Hfea  (8) 

On  the  other  hand,  for  n  _>  m  +  1 ,  K  consists  of  two  parts.   Using  (8), 

the  number  of  multiplications  to  compute  all  b_.  ,  1  £  i  <^  m,  is  equal  to 

-^■(m+l)  m(m-l)  .   Each  consecutive  _b.  ,m  +  l<£i<_n  requires  m(m+l) 
multiplications  by  Corollary  1.   Therefore,  for  all  n  >  m  +  1 


K  (n)  =  (n-m)  m(m+l)  +  ^r(m+l)  m(m-l) 
m  z 

2        1 
=  n  m(m+l)  -  m  (m+1)  +  y(m+l)  m(m-l) 

1       2 
=  n  m(m+l)  -  -r-  m(m+l) 

Each  nontrivial  row,  not  counting  the  first,  requires  m+1 
additions  and  each  trivial  row  only  one  addition  under  assumption  that 


the  first  row  in  B .  ,  is  nontrivial.   If  the  first  row  is  trivial,  then 

i  =  1  and  B .  ,  =  B~   is  an  identity  matrix.   No  additions  are  required 
—l-l   -H) 

in  computation  of  _b  =  a.-,*B_.  =  a      .   Therefore,  for  all  n  _<  m  +  1 

n 
K  (n)  =  Z  [(i-2)(m+l)  +  (m+1)  -  (i-1)] 
1-2 

n 
=   E  m(i-l) 
i=2 

=  -^  m  +  n(n-l) 

Similarly,  K(n)  ,  n  _>  m  +  1,  consists  of  two  parts  since  each_b.  , 

2 
m  +  1  _>  i  ^>  n  ,  requires  m  additions  by  Corollary  1  and  the  computation 

1   2 
of  remaining  _b .  ,  1  <^  i  <^  m  ,  altogether  requires  ■=■  m  (m-1)  additions. 


Thus 


2   12 
K  (n)  =  (n-m)  m  +  -r-  m  (m-1) 
a  / 


2   12 
=  n  m   -  y  m  (m+1) 


10 


3.        Parallel-Processor  System 

The  model  of  a  parallel  computer  (Fig.  1)  that  we  will 
consider  consists  of  Parallel  Memory  (PM) ,  Parallel  Processor  (PP) 
and  Control  Unit  (CU) .   PP  has  p  Arithmetic  Elements  (AE) .   Each  AE 
can  perform  any  arithmetic  operation  in  one  time  step  (clock) .   All 
AEs  are  assumed  to  perform  the  same  operation  on  each  time  step  (SIMD 
computer) .   It  is  assumed  further  that  no  time  is  required  to  communicate 
data  between  processors  and  memories,  and  the  storage  arrangement  of 
data  is  irrelevant. 

Algorithm  1 

Given  a  linear  recurrence  system  R<n,m>  divide  R<n,m>  into 
|"n/pl  subsystems  R  ^  <p,m>  ,  where  1  <  j  <  Tn/pl  : 
1.    begin 

2        x(0).=x 

z.       xp   •   2£q  » 

3.  for  j  :  =  1  until  [n/pl  do 

4.  begin 

5.  BJ^    :    =  identity  matrix  ; 

6.  for  k:  -  1  until  p  do  b£j):  =  aS^    *  b£^  ; 

7.  end 

8.  for  j:  =  1  until  fn/pl  do 

9.  for  k:  =  1  until  p  do  x,(j):  =  b,(:i)  *  x^"1^  ; 

r     k      — k     — p 

10.    end 


An  application  of  Algorithm  1  to  system  R<28,2>  is  shown 


11 
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Fig.  1.   A  model  of  a  parallel  computer 


12 


in  Fig.  2.   A  parallel  computer  with  four  AEs  is  used. 

In  general,  a  square  with  the  inverse  diagonal  symbolizes 
computation  of  b*J   =  a~'  •  J±rl    ,  1  <  i  <  4,  1  <  j  <   7  .   The  input 

on  the  top  of  the  square  represents  a.  and  the  output  on  the  bottom  _b.  . 

The  horizontal  input  represents  the  matrix  B_    ,  whose  rows  are  vectors 

b.  ,,  b.  „,  ...,  u  .,  .   The  dots  on  the  horizontal  line  indicate  the 
—l-l  —1-2       — m+1 

vectors  that  are  rows  in  B .  .,  .   The  unit  vector  u  ...  which  is  always  the 
— i-1  —m+1 

last  row  of  any  matrix  B_._,  is  not  connected  for  clarity.   The  closest 

dot  to  the  square  represents  the  top  row  of  B^._1  ,  that  is  b_._n  . 

The  natural  ordering  of  the  rows  in  the  matrix  is  established  by  proximity 

to  the  square.   Similarly,  the  empty  square  represents  the  computation 

of  x.   =  by'    *  x  J     ,  with  vertical  input  being  b.    ,  vertical 
l     —  l     — p     '  v  6  —  l 

output  x.    ,  and  the  horizontal  input  representing  vector  x 

Each  of  four  AEs  performs  the  computation  indicated  by 
approximately  1/4  of  the  squares.   AEs  should  be  assigned  to  squares  in 
a  way  that  minimizes  computational  time,  and  simplifies  memory  access 
and  communication.   The  assignment  imposed  by  Algorithm  1  for  R<28,2>  is 
shown  in  Table  1.   Note^that  the  upper  part  of  Table  1  represents  redun- 
dant computation  not  required  in  computation  on  a  serial  computer. 

Theorem  2 

The  number  of  time  steps  required  to  compute  a  linear 
recurrence  system  R<n,m>  using  a  parallel  computer  with  p  AEs  is 
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Table   1.      AE  Assignment   for  R<28,2> 
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Proof 


-((2n2+3m)  -  |^(2m2  +  m))  if  p  >  m  +  1 

n((m+l)  -  ^h  if  p  <  m  +  1 


The  proof  is  based  on  Algorithm  1.   The  solution  consists  of 


two  parts,  described  by  two  DO- loops  in  lines  6  and  9,  respectively. 
Firstly,  using  only  one  AE  all  b  J  's  ,  1  <   k  <   p  ,  are 

computed  for  some  R   <p,m>  .   Since  there  are  p  AEs ,  p  subsystems  can 

be  computed  at  the  same  time.   The  same  procedure  (DO-loop  in  line  6) 

2 
is  repeated  fn/p  1  times.   Note  that  the  DO-loop  in  line  3  can  be 

considered  to  be  a  vector  operation  that  is  executed  in  parallel  in 

slices  of  p  elements  each.   On  the  other  hand,  the  DO-loop  in  line  6 

is  executed  serially. 

Secondly,  using  all  p  AEs  all  x,    ,  1  <^  k  <_   p  ,  are  evaluated 

in  parallel  for  each  R   <p,m>  .   This  time,  the  DO-loop  in  line  8  is 
executed  serially  and  the  one  in  line  9  as  a  vector  operation  in  slices 
of  size  p. 

The  number,  T   ,  of  time  steps  required  to  compute  the  DO-loop 
in  line  6  using  one  AE  is  given  by  Corollary  2 


T' 


J    (p  -  (m+l)/2)(2m2  +m)       if  p  >  m  +  1  \ 
Jp(p-l)(2m+l)/2  ifp<m  +  lj 


Furthermore,  the  time  T"  needed  to  compute  the  DO-loop  in  line  9  using 
p  AEs  is  given  by  Corollary  1 
T"  =  2m  . 


L6 


Therefore,  the  total  number  of  time  steps  T  needed  to  compute  R<n,m> 

T   =   £-  T'  +  -  T" 
P     p2      P 

—(2m  -hn) =■  — r— (2m  +m)  H 2m 

p  p2   2  p 

^-(p-l)(2m+l)  +  -  2m 
2p  p 


^((2n2+3m)  -  f^(2m2  +  m) ) 


ci((m+l)  -  ^-) 


The  computational  time  T  given  by  Theorem  2  can  be  improved 
if  the  restriction  of  a  SIMD  model  is  removed;  that  is,  the  assumption 
that  all  AEs  perform  the  same  operation  at  the  same  time.   In  the  MIMD 
model  that  we  shall  consider,  (p-1)  AEs  still  perform  the  same  function 
at  the  same  time,  while  the  only  remaining  AE  may  not.   The  idea  can  be 
easily  explained  using  Fig.  2.   The  system  R   <4,2>  does  not  require 
computation  of  parallel  coefficients  _b  ,  b_ ,  b_  and  b_,  .   Since  initial 

vector  _x->  is  known,  the  variables  x..  ,  x„ ,  x„  and  x,  can  be  computed  directly 

from  x„  ,  that  is  x .  =  a.  *  x.  ,  ,  1  <  i  <  4.   Furthermore,  the  number  of 
—0  l—i   —l-l     —   — 

time  steps  2pm  =  16  is  less  than  the  time  needed  to  compute  all  parallel 

coefficients  b.  ,  1  <  i  <  4  ,  for  any  R(l)<p,m>  ,  2  <  i  <   4  .   Since  the 

time  to  compute  parallel  coefficients  is  (p  -  — =— )(2m  +  m)  =  25  ,  the 

system  R^  ^<4,2>  could  be  enlarged  to  R^  ^<6,2>  and  still  be  solved  in  24 
time  steps  using  only  one  AE.   This  is  shown  in  Fig.  3.   The  same  trick 
can  be  employed  after  every  p(p-l)  +  r  variables  where  r  is  the  size  of 
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the  first  sybsystem  R   <r,m>  .   At  that  moment  the  value  of  x  ft  and 
x   is  known  and  there  is  no  need  to  compute  parallel  coefficients 

^19'  ^20'  ±2V   ^22'  ^23  and  ^24  " 

Algorithm  2 

Given  a  linear  recurrence  system  R<n,m>   ,  divide  R<n,m>  into 
subsystems  R(l'<p(p-1)  +  r,  m>  ,  1  <  i  <  fn/(p(p-l)  +  r) 1  .   Divide 
furthermore  each  R   <p(p-l)  +  r,  m>  into  p  subsystems,  R   '   <r,m>  , 
where  r  =  Up  -  (m+l)/2)(2m  +  1)/2J   and  R^x>^  )<p,m>  ,  2  <  j  <  p  . 


1. 

begin 

2. 

x(o,P): 

-p 

3. 

for   i: 

4. 

beg 

5. 

6. 

7. 

for  i:   =1  until  [n/(p(p-l)  +  r)l  do 


jcld  .  xu-i,P) 

-0       -T5 

for  k:   =  1  until  r  do  *£'»  z      =   a^  *  x^1}  ; 

for  j :   =2  until  p  do 

8.  begin 

9.  Bo  J  '   =  identity  matrix  ; 

10.  for  k:   =  1  until  p  do  b^1'^:   =  a£1,j)  *  l^j) 

11.  end 

12.  for  k:   =  1  until  p  do  x^'2):   =  b^'2)  *  x^'^  ; 


13.  for  j:   =3  until  p  do 

14. 


fork:   =  1  until  p  do  x^'j):   =  bf1'^  *  x(i'j_1 
k        — k      — p 


15.  end 

16.  end 
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Theorem  3 

The  number  of  time  steps  required  to  compute  a  linear 
recurrence  system  R<n,m>  using  a  MIMD  computer  with  p  AEs 

— ■  -,  ,»  (2m  +  3m)        if  p  >  m  +  1 

p  +  m  +  1/z  r   — 

T  < 
p 

j-(p(2m+l)  +4)  if  p  <  m  +  1 

Proof 

The  proof  is  based  on  Algorithm  2.   A  recurrence  system 
R<n,m>  is  divided  into  systems  R^<p(p-1)  +  r,  m>  ,  1  <  i  <  [n/(p(p-l)  +  r)  ] 
Each  R   <p(p-l)  +  r,  m>  is  further  subdivided  into  p  subsystems: 
R^1'1')<r,m>  ,  and  R^1'^ '<p ,m>  ,  2  <  j  <  p  .   With  p  AE's  in  a  MIMD 
computer  one  AE  is  used  to  solve  R   '   <r,m>  in  2rm  time  steps.   The  variables 
are  computed  sequentially:   x,      =  jLw  '    *  2E.v  i    *   Eacn  °f  the 
remaining  (p-1)  processors  are  used  to  compute  parallel  coefficients 
bfc1'^  =  ^fc1'^  *  -k-1^  '  1  -   k  -  P  '  in  S°me  subsystem  R(l'J)<P,ni>  .   The 
number  of  time  steps  required  is  (p  -  ^r— )  (2m  +  ni)  ,  if  p  >  m  +  1  ,  or 

p(p-l)(2m  +  l)/2   if  p  <  m  +  1  by  Corollary  2. 

Assuming  that  the  solution  of  R   '   <r,m>  takes  at  most  as  much 
time  as  computation  of  all  parallel  coefficient  in  any  R     <p,m>  ,  it 
is  easy  to  find  the  value  of  r.   Hence, 

T  l(p  -  ^)(2m  +  1)/2J     if  p>n.  +  n 

(^  Lp(p-l)(2m  +  l)/4mj       if  p  <  m  +  1  J 

After  all  parallel  coefficients  for  all  subsystems  have  been  computed, 
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ii  (i,2)   .  (i,2)  .   (i,l) 

all  p  processors  are  used  to  compute  x,     =  b/  '  '  *  x      >  anc* 

consecutively        *£*    =  ^k^"^  *  ^'^^      for  a11  J'  3<  j  <p  . 
This  can  be  accomplished  in  2m(p-l)  time  steps.    Therefore, 

p(p_1"  +  r  ((P  "  IXL)(2m2  +  m)  +  2(p-l)  m) 

T   < 

p(p.1)  +  r  (P(p-D(2m  +  l)/2  +  2(p-l)  m 


2m2  +  3m)  -  ^(2m2  +  ra) ) 


(p  -  l)(p  +p^r).  2 


J^i)  (i  p(2m  +  1}  +  2ra) 


P+m  +  l/2  (2m'  +  3m)      «  P  >  -  +  1 


^-(p(2m+l)  +  4m)  if  p  <  m  +  1 
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4.        Pipelined-Processor  System 

A  pipelined-processor  model  consists  of  Main  Memory  (MM) , 
Pipelined  Processor  (PP)  and  Control  Unit  (CU) .   PP  may  consist  of 
several  pipelined  functional  units.   In  our  model,  we  shall  assume 
only  2  functional  units:   multiplication  pipe  and  addition  pipe. 
Each  of  them  has  s  stages.   These  two  functional  units  are  connected 
serially  with  the  multiplication  pipe  feeding  the  add  pipe  through 
register  Rl.   The  results  from  the  add  pipe  are  either  stored  in  MM  or 
fed  back  into  the  add  pipe  through  register  R2  (Fig.  4). 

Algorithm  3 

Given  a  linear  recurrence  system  R<n,m>  ,  divide  R<n,m> 
into  ("n/pl  subsystems  R   <p,m>  ,  where  1  <   j  <    fn/pl  .   Execute  the 
following  algorithm: 

1.  begin 

2 

2.  for  i:   =  0  until  p  -  1  do  a  . :   =  null  ; 

2  1 

3.  for  i:   =0  until  p  -  1  do  a  ,  .  , ,  :   =  null  ; 

— n+i+1 

4.  for  j  =  1  until  fn/pl  +  p  do 

5.  begin 

6.  Bn     identity  matrix  ; 

7.  for  k:  -  1  until  p  do  b^"k+1>:  -  a^"k+1)  *  B^k+1) 

8.  for  k:   =  1  until  p  do  x(V>  =  b(J^  *  k^P"1'  ; 

v  p-k     — p-k+1   — p 

9.  end 
10.       end 

To  point  out  the  difference  between  Algorithms  1  and  3  the 

system  R<28,2>  is  redrawn  in  Fig.  5.   The  order  of  fetching  from  MM  and 

computing  of  coefficients  and  variables  in  PP  is  indicated  by  dotted 
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Fig.  4.   A  model  of  a  pipelined  computer 
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rj  /     //  A 
* — r-  i  /    Aa- 


-# 


2  4 


arrows.   For  example,  the  computation  of  b_  ,  requires  a  ,  ,  b_  „  and  u, 

to  be  fetched  from  MM  and  entered  into  PP.   It  is  impossible  to  proceed 
with  computation  of  a_.. ,.  since  b_..  ,  will  become  available  only  after  a 

certain  number  of  time  steps.   Therefore,  in  order  to  keep  pipelined 
functional  units  busy,  Algorithm  3  proceeds  by  computing  b11}  bQ  ,  x.  , 

— ii   — O      H 

x„  ,  x„  ,  x-  and  b-..  ,  b,Q  .   At  that  moment  of  time,  b.,  should  be 
available  from  MM  or  at  the  output  of  PP  to  be  used  in  computation  of  b.,  . 

Theorem  4 

Given  an  integer  p  >  0,  any  linear  recurrence  system  R<n,m> 

can  be  solved  in  T  <    ( \-   p)  (m  +  2)  time  steps  using  pipelined  functional 

s    p 

units  with  s  ~    ( (m  +  2)  p  -  (m  +  1))  stages. 

Proof 

The  proof  is  based  on  Algorithm  3.   The  system  R<n,m>  is 
divided  into  subsystems  R  J  <p,m>  ,  1  <  j  <  Tn/pl  .   To  keep  pipelined 
units  busy  almost  all  the  time  p  +  1  subsystems  are  being  solved 
concurrently,  that  is,  different  subsystems  are  using  different  stages 
of  the  pipelined  units. 

Each  R^<p,m>  system  is  solved  in  two  parts:   b£J'  =  a^3   * 

B^J  ,  1  <  k  <  p  ,  is  computed  first  and  ac^*   =  b^?,.  *  x^_1)  , 
-k-1  '       —      —  *    *       v  p-k+1   — p-k+1   -p 

1  £  k  _<  p  ,  is  evaluated  afterwards.   With  p  +  1  systems  computed 

concurrently,  the  order  of  computation  is  as  follows 

...  b«>  .  b«-«  ,  ....  b«-p+1>  ,  «<J-p>  ,  x«:">  ,  ....  «a-rt 

— 1    '  — 2  '  — p  p        p-1  1 

h(j+l)   h(j)        h(J-P+2)   V(j-P+D   V(j-P+D        X(j-P+D 

h  >  b2    ,  ...,  b^       ,  xp       ,  xp_1      xx       ...  . 
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Therefore,  at  least  (k-1)  _b . ' s  and  k  x.'s  must  be  computed 

before  their  values  are  used  in  subsequent  computations.   Each  b.  is 

a  vector  containing  m  +  1  elements.   Each  element  of  b_.  is  computed  as  a 

sum-of-products  (sop).   Similarly,  each  x.  is  a  sop.   Hence,  there  are 
N  =  (m+1) (p-1)  +  p  sop's  altogether.   Each  sop  has  m+1  products  at  most. 
This  implies  that  multiplication  and  addition  pipes  are  used  m+1  times 
for  each  sop.   The  bottleneck  of  the  pipelined  processor  is  obviously 
addition  pipe  since  a  new  product  can  be  added  to  the  corresponding 
partial  sum  only  after  one  addition-pipe  step.   Therefore,  m+1  addition 
and  one     multiplication  step  are  needed  to  obtain  p  solutions  of 
R<n,m>  .   Total  time  needed  to  compute  R<n,m>  is  less  than  ( 1-  p)  (m  +  2) 

pipeline  processor  steps.   The  number  of  stages  s  in  the  addition  pipe 
must  be  equal  to  N.   This  way,  a  partial  sum  for  some  sop  and  a  new  product 
to  be  added  to  that  partial  sum  are  generated  at  the  same  time  by 
addition  and  multiplication  pipes,  respectively,  and  are  ready  to  be 
entered  into  the  addition  pipe  again. 

■ 

It  is  worth  noting  the  desirability  of  a  large  p  _<  /n   (the 
function  (—  +  p) (m  +  2)  has  a  minimum  at  p  =  /n) .   However,  a  large  p 
requires  a  large  number  of  stages  in  pipelined  functional  units.   Since 
the  number  of  gate  levels  in  an  implementation  of  floating-point  multiply 
or  add  is  fixed,  the  number  of  stages  s  will  hardly  exceed  10-15.   In 
present-day  machines  s  is  between  2  and  8. 

Secondly,  increased  s  increases  the  cost  of  functional  units 
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since  extra  latches  or  registers  have  to  be  added  which  in  turn 
introduces  extra  delay,  reducing  the  effect  of  pipelining. 

The  problem  of  finding  optimal  number  of  stages  s  for  a  given 
p  was  answered  by  Theorem  3.   In  reality,  pipelined  processors  have 
fixed  s  and  the  problem  is  in  finding  minimal  T   .   Without  loss  of 

generality,  we  shall  assume  that  all  functional  units  have  the  same 
number  of  stages. 

Corollary  3 

Given  a  pipelined  processor  with  s-stage  functional  units, 
any  R<n,m>  can  be  computed  in 

Ts  <  min[(|,  +  P')(m  +  2)  ,  (m  +  2)  f  '    (m+1)  (^  +  p")  (>  4-  2)] 

.  ,     .    |s+_m  +  l|  „         fs   +  m  +  1"] 

Where  P     =    L    m+  2    J    and  P     =    J      m+  2       I    ' 


Proof 


For  a  given  p,  N  =  (m  +  2)  p  -  (m  +  1)  .   If  s  =  N  ,  all 


stages  are  busy  performing  some  computation.  If  s  >  N  ,  exactly  s  -  N 
stages  are  idling  and  waiting  for  the  partial  sums  to  become  available 
at  the  output  of  the  addition  pipe.   The  time  T  required  to  compute 

R<n,m>  is  still  the  same,  ( h  p)  (m  +  2)  .   However,  the  effectiveness 

E  =  T  /s  has  increased.   On  the  other  hand,  if  s  <  N   there  is  no  idling, 
s    s 

but  more  pipeline  time  steps  are  required  to  compute  R<n,m>  .   Consequently, 
I.-f  <f +PH-+2)  . 

For  a  given  s,  it  is  possible  to  choose  p  such  that  s  >_  N 
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(P  1  (s  +  m  +  l)/(.  +  2))  or  s  <  N  (p  >  (.  +  m  +  i)/(m  +  2))<   In  either 
case  a  p  that  minimizes  Ts  is  sought.   In  the  former  case,  the  function 
(p  +  P)(m  +  2)  is  monotonically  decreasing  for  positive  p's,  p  <  &   , 
with  the  minimum  at  p  =  4  .   Assuming  p«n  ,  the  minimum  time  is 
obtained  for  the  largest  p  such  that  p  <  (s  +  m  +  l)/(m  +  2)  .   When 


s  <  N  the  function  f(|  +  p) (m  +  2)  is  monotonically  increasing  fo 


positive  p's,  and  therefore  the  smallest  p,  p  >  (s  +  m  +  i)/(m  +  2) 
requires  minimum  time  to  compute  R<n,m> 
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5.        Conclusion 

The  results  of  this  paper  are  collected  in  Table  2.  The 
exact  bounds  have  been  approximated  as  close  as  possible  by  short, 
compact  expressions. 

The  advantage  of  MIMD  over  SIMD  systems  is  negligible  for  large 
p  and  small  m.   The  speedup  of  a  MIMD  machine  over  an  SIMD  machine  increases 
by  less  than  1  for  small  p.   For  example,  for  p  =  2,  m  =  1  speedups  are 
7/5  and  4/5  .   However,  for  small  p  the  advantage  of  a  parallel  computer 
over  a  single-processor  system  is  not  earthshaking. 

The  pipelined  computer  seems  to  have  advantage  for  medium  to 
large  p  and  moderate  m  (approximately  m  _>  4).   However,  if  performance- 
cost  ratio  is  considered  the  pipelined  computer  has  the  advantage  even 
for  small  m  (m  =  1,  2,  3).   Let  us  assume  that  the  cost  of  an  AE  is  1. 
Then  the  cost  of  pipelining  the  same  AE  is  1.2-2.0  depending  on  the 
number  of  stages.   The  extra  cost  is  incurred  because  of  additional 
hardware  for  storing  partial  results  on  each  stage.   Therefore,  the  ratio 
of  speedup  and  cost  for  parallel  and  pipelined  computers 


c  in     =    2    2(1  +  m/p  +  l/2p). 
V  p   2m  +  3  ^      2m  +  3     ; 


S  /C 


s   s   m  +  4  +  5/m 
where  C  =  p  and  C  =  2  are  cost  of  parallel  and  pipelined  arithmetic. 

It  should  be  noted  that  extra  delay  introduced  by  writing  and 
reading  of  registers  on  each  stage  was  neglected.   That  alone  may  degrade 
the  performance  of  the  pipelined  computer  by  10%-50%  depending  on  the 
number  of  stages.   The  advantage  of  the  pipelined  computer  can  be  easily 
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explained  by  inadequacy  of  the  parallel  model.   It  was  assumed  that  each 
AE  is  capable  of  executing  one  operation  in  one  time  step.   However,  hard- 
ware for  multiplication  and  addition  has  only  one  logic  unit   in  common: 
mantissa  adder.   Sharing  it  between  multiplication  and  addition  may  cost 
more  than  duplicating  it.    Therefore,  instead  of  having  exactly 
one-half  of  the  AE  logic  idle  during  each  operation,  an  AE  could  be 
designed  to  execute  one  multiplication  and  one  addition  simultaneously. 
This  way  the  speedup  of  the  parallel  model  would  double,  providing 
better  performance  over  the  pipelined  model  for  all  p  and  m.   Furthermore, 
the  performance  of  the  parallel  computer  can  be  increased  by  increasing 
p.   As  was  mentioned  earlier,  this  is  not  possible  with  the  pipelined 
computer  since  there  is  only  finite  number  of  stages  that  multiplication 
or  addition  logic  will  tolerate. 

A  disadvantage  of  the  parallel  model  is  the  access  of  Parallel 
Memory,  which  has  p  input  and  p  output  ports.   If  it  is  thought  of  as  a 
two-dimensional  array  of  words,  it  must  allow  parallel  access  to  any  p 
elements  in  any  row  or  column.   This  can  be  easily  seen  from  Table  1. 
For  example,  to  compute  t>„  ,  b-.    ,    b        and  b   ,.  using  four  AE's,  the  PM 

must  deliver  simultaneously  a_»    ,  <*-.    ,   _a.. ,  and  _a.. ,.  .   After  computation 

_b„  ,  b      ,  b   and  b_   are  stored  in  place  of  a~    ,  a.^    ,  a., ..  and  _a  _  . 

Assuming  four  memory  modules,  the  upper  half  of  Table  1  can  be  used  as 
the  storage  assignment  for  each  memory  module.   However,  when  computing 
x„  ,  x  _  ,  x1 ,  and  x  _  ,  the  PM  must  deliver  simultaneously  _b„  ,  _b    , 

_b.. ,  and  _b,  „  which  are  unfortunately  all  in  the  same  memory  module. 
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Therefore,  a  memory  scheme  with  a  skewed  storage  and  input  and  output 
alignment  network  is  required,  increasing  further  the  cost  of  a  parallel 
system. 

It  should  be  mentioned  at  the  end,  that  real  parallel  and 
pipelined  computers  have  not  been  designed  for  solving  linear  recurrences 
and  therefor  mapping  of  algorithms  presented  in  this  paper  on  the 
available  machines  may  present  difficulty  and  lead  to  execution  times 
above  the  bounds  given  in  Table  2. 
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